Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 24(1): 41, 2023 Feb 08.
Artigo em Inglês | MEDLINE | ID: mdl-36755242

RESUMO

BACKGROUND: Protein S-nitrosylation (SNO) plays a key role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as an important mechanism for regulating protein functions and cell signaling of all main classes of protein. It is involved in several biological processes including immune response, protein stability, transcription regulation, post translational regulation, DNA damage repair, redox regulation, and is an emerging paradigm of redox signaling for protection against oxidative stress. The development of robust computational tools to predict protein SNO sites would contribute to further interpretation of the pathological and physiological mechanisms of SNO. RESULTS: Using an intermediate fusion-based stacked generalization approach, we integrated embeddings from supervised embedding layer and contextualized protein language model (ProtT5) and developed a tool called pLMSNOSite (protein language model-based SNO site predictor). On an independent test set of experimentally identified SNO sites, pLMSNOSite achieved values of 0.340, 0.735 and 0.773 for MCC, sensitivity and specificity respectively. These results show that pLMSNOSite performs better than the compared approaches for the prediction of S-nitrosylation sites. CONCLUSION: Together, the experimental results suggest that pLMSNOSite achieves significant improvement in the prediction performance of S-nitrosylation sites and represents a robust computational approach for predicting protein S-nitrosylation sites. pLMSNOSite could be a useful resource for further elucidation of SNO and is publicly available at https://github.com/KCLabMTU/pLMSNOSite .


Assuntos
Óxido Nítrico , Proteínas , Animais , Proteínas/metabolismo , Óxido Nítrico/metabolismo , Oxirredução , Processamento de Proteína Pós-Traducional , Transdução de Sinais
3.
Methods Mol Biol ; 2499: 285-322, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35696087

RESUMO

Posttranslational modification (PTM ) is a ubiquitous phenomenon in both eukaryotes and prokaryotes which gives rise to enormous proteomic diversity. PTM mostly comes in two flavors: covalent modification to polypeptide chain and proteolytic cleavage. Understanding and characterization of PTM is a fundamental step toward understanding the underpinning of biology. Recent advances in experimental approaches, mainly mass-spectrometry-based approaches, have immensely helped in obtaining and characterizing PTMs. However, experimental approaches are not enough to understand and characterize more than 450 different types of PTMs and complementary computational approaches are becoming popular. Recently, due to the various advancements in the field of Deep Learning (DL), along with the explosion of applications of DL to various fields, the field of computational prediction of PTM has also witnessed the development of a plethora of deep learning (DL)-based approaches. In this book chapter, we first review some recent DL-based approaches in the field of PTM site prediction. In addition, we also review the recent advances in the not-so-studied PTM , that is, proteolytic cleavage predictions. We describe advances in PTM prediction by highlighting the Deep learning architecture, feature encoding, novelty of the approaches, and availability of the tools/approaches. Finally, we provide an outlook and possible future research directions for DL-based approaches for PTM prediction.


Assuntos
Aprendizado Profundo , Proteômica , Espectrometria de Massas , Processamento de Proteína Pós-Traducional , Proteínas/química
4.
Sci Rep ; 12(1): 6541, 2022 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-35449168

RESUMO

In classical machine learning, regressors are trained without attempting to gain insight into the mechanism connecting inputs and outputs. Natural sciences, however, are interested in finding a robust interpretable function for the target phenomenon, that can return predictions even outside of the training domains. This paper focuses on viscosity prediction problem in steelmaking, and proposes Einstein-Roscoe regression (ERR), which learns the coefficients of the Einstein-Roscoe equation, and is able to extrapolate to unseen domains. Besides, it is often the case in the natural sciences that some measurements are unavailable or expensive than the others due to physical constraints. To this end, we employ a transfer learning framework based on Gaussian process, which allows us to estimate the regression parameters using the auxiliary measurements available in a reasonable cost. In experiments using the viscosity measurements in high temperature slag suspension system, ERR is compared favorably with various machine learning approaches in interpolation settings, while outperformed all of them in extrapolation settings. Furthermore, after estimating parameters using the auxiliary dataset obtained at room temperature, an increase in accuracy is observed in the high temperature dataset, which corroborates the effectiveness of the proposed approach.

5.
Mol Omics ; 16(5): 448-454, 2020 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-32555810

RESUMO

Methylation, which is one of the most prominent post-translational modifications on proteins, regulates many important cellular functions. Though several model-based methylation site predictors have been reported, all existing methods employ machine learning strategies, such as support vector machines and random forest, to predict sites of methylation based on a set of "hand-selected" features. As a consequence, the subsequent models may be biased toward one set of features. Moreover, due to the large number of features, model development can often be computationally expensive. In this paper, we propose an alternative approach based on deep learning to predict arginine methylation sites. Our model, which we termed DeepRMethylSite, is computationally less expensive than traditional feature-based methods while eliminating potential biases that can arise through features selection. Based on independent testing on our dataset, DeepRMethylSite achieved efficiency scores of 68%, 82% and 0.51 with respect to sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Importantly, in side-by-side comparisons with other state-of-the-art methylation site predictors, our method performs on par or better in all scoring metrics tested.


Assuntos
Algoritmos , Arginina/metabolismo , Aprendizado Profundo , Processamento de Proteína Pós-Traducional , Proteínas/metabolismo , Bases de Dados de Proteínas , Metilação , Redes Neurais de Computação , Curva ROC , Reprodutibilidade dos Testes
6.
BMC Bioinformatics ; 21(Suppl 3): 63, 2020 Apr 23.
Artigo em Inglês | MEDLINE | ID: mdl-32321437

RESUMO

BACKGROUND: Protein succinylation has recently emerged as an important and common post-translation modification (PTM) that occurs on lysine residues. Succinylation is notable both in its size (e.g., at 100 Da, it is one of the larger chemical PTMs) and in its ability to modify the net charge of the modified lysine residue from + 1 to - 1 at physiological pH. The gross local changes that occur in proteins upon succinylation have been shown to correspond with changes in gene activity and to be perturbed by defects in the citric acid cycle. These observations, together with the fact that succinate is generated as a metabolic intermediate during cellular respiration, have led to suggestions that protein succinylation may play a role in the interaction between cellular metabolism and important cellular functions. For instance, succinylation likely represents an important aspect of genomic regulation and repair and may have important consequences in the etiology of a number of disease states. In this study, we developed DeepSuccinylSite, a novel prediction tool that uses deep learning methodology along with embedding to identify succinylation sites in proteins based on their primary structure. RESULTS: Using an independent test set of experimentally identified succinylation sites, our method achieved efficiency scores of 79%, 68.7% and 0.48 for sensitivity, specificity and MCC respectively, with an area under the receiver operator characteristic (ROC) curve of 0.8. In side-by-side comparisons with previously described succinylation predictors, DeepSuccinylSite represents a significant improvement in overall accuracy for prediction of succinylation sites. CONCLUSION: Together, these results suggest that our method represents a robust and complementary technique for advanced exploration of protein succinylation.


Assuntos
Aprendizado Profundo , Processamento de Proteína Pós-Traducional , Proteínas/metabolismo , Succinatos/metabolismo , Sítios de Ligação , Ciclo do Ácido Cítrico , Lisina/metabolismo , Proteínas/química
7.
Mol Omics ; 15(3): 189-204, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-31025681

RESUMO

Glutarylation, which is a newly identified posttranslational modification that occurs on lysine residues, has recently emerged as an important regulator of several metabolic and mitochondrial processes. However, the specific sites of modification on individual proteins, as well as the extent of glutarylation throughout the proteome, remain largely uncharacterized. Though informative, proteomic approaches based on mass spectrometry can be expensive, technically challenging and time-consuming. Therefore, the ability to predict glutarylation sites from protein primary sequences can complement proteomics analyses and help researchers study the characteristics and functional consequences of glutarylation. To this end, we used Random Forest (RF) machine learning strategies to identify the physiochemical and sequence-based features that correlated most substantially with glutarylation. We then used these features to develop a novel method to predict glutarylation sites from primary amino acid sequences using RF. Based on 10-fold cross-validation, the resulting algorithm, termed 'RF-GlutarySite', achieved efficiency scores of 75%, 81%, 68% and 0.50 with respect to accuracy (ACC), sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC), respectively. Likewise, using an independent test set, RF-GlutarySite exhibited ACC, SN, SP and MCC scores of 72%, 73%, 70% and 0.43, respectively. Results using both 10-fold cross validation and an independent test set were on par with or better than those achieved by existing glutarylation site predictors. Notably, RF-GlutarySite achieved the highest SN score among available glutarylation site prediction tools. Consequently, our method has the potential to uncover new glutarylation sites and to facilitate the discovery of relationships between glutarylation and well-known lysine modifications, such as acetylation, methylation and SUMOylation, as well as a number of recently identified lysine modifications, such as malonylation and succinylation.


Assuntos
Biologia Computacional/métodos , Glutaratos/metabolismo , Proteômica/métodos , Algoritmos , Sequência de Aminoácidos , Aminoácidos , Modelos Químicos , Conformação Proteica , Processamento de Proteína Pós-Traducional , Máquina de Vetores de Suporte
8.
IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 1844-1852, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29990125

RESUMO

The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies. RF-NR is freely available at http://bcb.ncat.edu/RF_NR.


Assuntos
Biologia Computacional/métodos , Receptores Citoplasmáticos e Nucleares/química , Receptores Citoplasmáticos e Nucleares/classificação , Algoritmos , Bases de Dados de Proteínas , Aprendizado de Máquina
9.
Artigo em Inglês | MEDLINE | ID: mdl-28113600

RESUMO

Computing similarity or dissimilarity between protein structures is an important task in structural biology. A conventional method to compute protein structure dissimilarity requires structural alignment of the proteins. However, defining one best alignment is difficult, especially when the structures are very different. In this paper, we propose a new similarity measure for protein structure comparisons using a set of multi-view 2D images of 3D protein structures. In this approach, each protein structure is represented by a subspace from the image set. The similarity between two protein structures is then characterized by the canonical angles between the two subspaces. The primary advantage of our method is that precise alignment is not needed. We employed Grassmann Discriminant Analysis (GDA) as the subspace-based learning in the classification framework. We applied our method for the classification problem of seven SCOP structural classes of protein 3D structures. The proposed method outperformed the k-nearest neighbor method (k-NN) based on conventional alignment-based methods CE, FATCAT, and TM-align. Our method was also applied to the classification of SCOP folds of membrane proteins, where the proposed method could recognize the fold HEM-binding four-helical bundle (f.21) much better than TM-Align.


Assuntos
Biologia Computacional/métodos , Conformação Proteica , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Bases de Dados de Proteínas , Análise Discriminante , Dobramento de Proteína , Proteínas/classificação
10.
BMC Bioinformatics ; 18(Suppl 16): 577, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29297322

RESUMO

BACKGROUND: The ß-Lactamase (BL) enzyme family is an important class of enzymes that plays a key role in bacterial resistance to antibiotics. As the newly identified number of BL enzymes is increasing daily, it is imperative to develop a computational tool to classify the newly identified BL enzymes into one of its classes. There are two types of classification of BL enzymes: Molecular Classification and Functional Classification. Existing computational methods only address Molecular Classification and the performance of these existing methods is unsatisfactory. RESULTS: We addressed the unsatisfactory performance of the existing methods by implementing a Deep Learning approach called Convolutional Neural Network (CNN). We developed CNN-BLPred, an approach for the classification of BL proteins. The CNN-BLPred uses Gradient Boosted Feature Selection (GBFS) in order to select the ideal feature set for each BL classification. Based on the rigorous benchmarking of CCN-BLPred using both leave-one-out cross-validation and independent test sets, CCN-BLPred performed better than the other existing algorithms. Compared with other architectures of CNN, Recurrent Neural Network, and Random Forest, the simple CNN architecture with only one convolutional layer performs the best. After feature extraction, we were able to remove ~95% of the 10,912 features using Gradient Boosted Trees. During 10-fold cross validation, we increased the accuracy of the classic BL predictions by 7%. We also increased the accuracy of Class A, Class B, Class C, and Class D performance by an average of 25.64%. The independent test results followed a similar trend. CONCLUSIONS: We implemented a deep learning algorithm known as Convolutional Neural Network (CNN) to develop a classifier for BL classification. Combined with feature selection on an exhaustive feature set and using balancing method such as Random Oversampling (ROS), Random Undersampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE), CNN-BLPred performs significantly better than existing algorithms for BL classification.


Assuntos
Algoritmos , Redes Neurais de Computação , beta-Lactamases/classificação , Sequência de Aminoácidos , Bases de Dados de Proteínas , Modelos Moleculares , Curva ROC , Reprodutibilidade dos Testes
11.
J Bioinform Comput Biol ; 14(5): 1644003, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27806683

RESUMO

Despite the accumulation of quantitative trait loci (QTL) data in many complex human diseases, most of current approaches that have attempted to relate genotype to phenotype have achieved limited success, and genetic factors of many common diseases are yet remained to be elucidated. One of the reasons that makes this problem complex is the existence of single nucleotide polymorphism (SNP) interaction, or epistasis. Due to excessive amount of computation for searching the combinatorial space, existing approaches cannot fully incorporate high-order SNP interactions into their models, but limit themselves to detecting only lower-order SNP interactions. We present an empirical approach based on ridge regression with polynomial kernels and model selection technique for determining the true degree of epistasis among SNPs. Computer experiments in simulated data show the ability of the proposed method to correctly predict the number of interacting SNPs provided that the number of samples is large enough relative to the number of SNPs. For cases in which the number of the available samples is limited, we propose to perform sliding window approach to ensure sufficiently large sample/SNP ratio in each window. In computational experiments using heterogeneous stock mice data, our approach has successfully detected subregions that harbor known causal SNPs. Our analysis further suggests the existence of additional candidate causal SNPs interacting to each other in the neighborhood of the known causal gene. Software is available from https://github.com/HirotoSaigo/KDSNP .


Assuntos
Polimorfismo de Nucleotídeo Único , Software , Algoritmos , Fosfatase Alcalina/genética , Animais , Simulação por Computador , Epistasia Genética , Frequência do Gene , Estudo de Associação Genômica Ampla/métodos , Humanos , Aprendizado de Máquina , Camundongos Endogâmicos , Modelos Genéticos , Análise de Regressão
12.
J Chem Inf Model ; 55(12): 2519-27, 2015 Dec 28.
Artigo em Inglês | MEDLINE | ID: mdl-26549421

RESUMO

Graph data are becoming increasingly common in machine learning and data mining, and its application field pervades to bioinformatics and cheminformatics. Accordingly, as a method to extract patterns from graph data, graph mining recently has been studied and developed rapidly. Since the number of patterns in graph data is huge, a central issue is how to efficiently collect informative patterns suitable for subsequent tasks such as classification or regression. In this paper, we consider mining discriminative subgraphs from graph data with multiple labels. The resulting task has important applications in cheminformatics, such as finding common functional groups that trigger multiple drug side effects, or identifying ligand functional groups that hit multiple targets. In computational experiments, we first verify the effectiveness of the proposed approach in synthetic data, then we apply it to drug adverse effect prediction problem. In the latter dataset, we compared the proposed method with L1-norm logistic regression in combination with the PubChem/Open Babel fingerprint, in that the proposed method showed superior performance with a much smaller number of subgraph patterns. Software is available from https://github.com/axot/GLP.


Assuntos
Mineração de Dados/métodos , Modelos Teóricos , Relação Quantitativa Estrutura-Atividade , Algoritmos , Mineração de Dados/normas , Descoberta de Drogas , Estrutura Molecular , Análise de Regressão
13.
J Chem Inf Model ; 51(5): 1183-94, 2011 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-21506615

RESUMO

The identification of rules governing molecular recognition between drug chemical substructures and protein functional sites is a challenging issue at many stages of the drug development process. In this paper we develop a novel method to extract sets of drug chemical substructures and protein domains that govern drug-target interactions on a genome-wide scale. This is made possible using sparse canonical correspondence analysis (SCCA) for analyzing drug substructure profiles and protein domain profiles simultaneously. The method does not depend on the availability of protein 3D structures. From a data set of known drug-target interactions including enzymes, ion channels, G protein-coupled receptors, and nuclear receptors, we extract a set of chemical substructures shared by drugs able to bind to a set of protein domains. These two sets of extracted chemical substructures and protein domains form components that can be further exploited in a drug discovery process. This approach successfully clusters protein domains that may be evolutionary unrelated but that bind a common set of chemical substructures. As shown in several examples, it can also be very helpful for predicting new protein-ligand interactions and addressing the problem of ligand specificity. The proposed method constitutes a contribution to the recent field of chemogenomics that aims to connect the chemical space with the biological space.


Assuntos
Desenho de Fármacos , Enzimas/química , Canais Iônicos/química , Receptores Citoplasmáticos e Nucleares/química , Receptores Acoplados a Proteínas G/química , Algoritmos , Sítios de Ligação , Mineração de Dados , Descoberta de Drogas , Ligantes , Ligação Proteica , Domínios e Motivos de Interação entre Proteínas
14.
Stat Appl Genet Mol Biol ; 10: Article 6, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21291416

RESUMO

Infections with the human immunodeficiency virus type 1 (HIV-1) are treated with combinations of drugs. Unfortunately, HIV responds to the treatment by developing resistance mutations. Consequently, the genome of the viral target proteins is sequenced and inspected for resistance mutations as part of routine diagnostic procedures for ensuring an effective treatment. For predicting response to a combination therapy, currently available computer-based methods rely on the genotype of the virus and the composition of the regimen as input. However, no available tool takes full advantage of the knowledge about the order of and the response to previously prescribed regimens. The resulting high-dimensional feature space makes existing methods difficult to apply in a straightforward fashion. The machine learning system proposed in this work, sequence boosting, is tailored to exploiting such high-dimensional information, i.e. the extraction of longitudinal features, by utilizing the recent advancements in data mining and boosting. When applied to predicting the latest treatment outcome for 3,759 treatment-experienced patients from the EuResist integrated database, sequence boosting achieved superior performance compared to SVMs with RBF kernels. Moreover, sequence boosting allows an easy access to the discriminative treatment information. Analysis of feature importance values provided by our model confirmed known facts regarding HIV treatment. For instance, application of potent and recently licensed drugs was beneficial for patients, and, conversely, the patient group that was subject to NRTI mono-therapies in the past had poor treatment perspectives today. Furthermore, our model revealed novel biological insights. More precisely, the combination of previously used drugs with their in vivo response is more informative than the information of previously used drugs alone. Using this information improves the performance of systems for predicting therapy outcome.


Assuntos
Fármacos Anti-HIV/uso terapêutico , Inteligência Artificial , Mineração de Dados/métodos , Farmacorresistência Viral/genética , Infecções por HIV/tratamento farmacológico , HIV-1/genética , Simulação por Computador , Interpretação Estatística de Dados , Bases de Dados Factuais , Quimioterapia Combinada , Humanos , Mutação , Resultado do Tratamento
15.
BMC Bioinformatics ; 11 Suppl 1: S31, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122204

RESUMO

BACKGROUND: Understanding of secondary metabolic pathway in plant is essential for finding druggable candidate enzymes. However, there are many enzymes whose functions are not yet discovered in organism-specific metabolic pathways. Towards identifying the functions of those enzymes, assignment of EC numbers to the enzymatic reactions they catalyze plays a key role, since EC numbers represent the categorization of enzymes on one hand, and the categorization of enzymatic reactions on the other hand. RESULTS: We propose reaction graph kernels for automatically assigning EC numbers to unknown enzymatic reactions in a metabolic network. Reaction graph kernels compute similarity between two chemical reactions considering the similarity of chemical compounds in reaction and their relationships. In computational experiments based on the KEGG/REACTION database, our method successfully predicted the first three digits of the EC number with 83% accuracy. We also exhaustively predicted missing EC numbers in plant's secondary metabolism pathway. The prediction results of reaction graph kernels on 36 unknown enzymatic reactions are compared with an expert's knowledge. Using the same data for evaluation, we compared our method with E-zyme, and showed its ability to assign more number of accurate EC numbers. CONCLUSION: Reaction graph kernels are a new metric for comparing enzymatic reactions.


Assuntos
Biologia Computacional/métodos , Enzimas/metabolismo , Plantas/metabolismo , Bases de Dados Factuais
16.
Bioinformatics ; 23(18): 2455-62, 2007 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-17698858

RESUMO

MOTIVATION: Human immunodeficiency virus type 1 (HIV-1) evolves in human body, and its exposure to a drug often causes mutations that enhance the resistance against the drug. To design an effective pharmacotherapy for an individual patient, it is important to accurately predict the drug resistance based on genotype data. Notably, the resistance is not just the simple sum of the effects of all mutations. Structural biological studies suggest that the association of mutations is crucial: even if mutations A or B alone do not affect the resistance, a significant change might happen when the two mutations occur together. Linear regression methods cannot take the associations into account, while decision tree methods can reveal only limited associations. Kernel methods and neural networks implicitly use all possible associations for prediction, but cannot select salient associations explicitly. RESULTS: Our method, itemset boosting, performs linear regression in the complete space of power sets of mutations. It implements a forward feature selection procedure where, in each iteration, one mutation combination is found by an efficient branch-and-bound search. This method uses all possible combinations, and salient associations are explicitly shown. In experiments, our method worked particularly well for predicting the resistance of nucleotide reverse transcriptase inhibitors (NRTIs). Furthermore, it successfully recovered many mutation associations known in biological literature. AVAILABILITY: http://www.kyb.mpg.de/bs/people/hiroto/iboost/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Antivirais/administração & dosagem , Análise Mutacional de DNA/métodos , Farmacorresistência Viral/genética , Genótipo , HIV-1/efeitos dos fármacos , HIV-1/genética , Armazenamento e Recuperação da Informação/métodos , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , HIV-1/classificação , Dados de Sequência Molecular
17.
Artigo em Inglês | MEDLINE | ID: mdl-17048398

RESUMO

Many biomedical problems relate to mutant functional properties across a sequence space of interest, e.g., flu, cancer, and HIV. Detailed knowledge of mutant properties and function improves medical treatment and prevention. A functional census of p53 cancer rescue mutants would aid the search for cancer treatments from p53 mutant rescue. We devised a general methodology for conducting a functional census of a mutation sequence space by choosing informative mutants early. The methodology was tested in a double-blind predictive test on the functional rescue property of 71 novel putative p53 cancer rescue mutants iteratively predicted in sets of three (24 iterations). The first double-blind 15-point moving accuracy was 47 percent and the last was 86 percent; r = 0.01 before an epiphanic 16th iteration and r = 0.92 afterward. Useful mutants were chosen early (overall r = 0.80). Code and data are freely available (http://www.igb.uci.edu/research/research.html, corresponding authors: R.H.L. for computation and R.K.B. for biology).


Assuntos
Biologia Computacional/métodos , Modelos Estatísticos , Mutação/genética , Proteína Supressora de Tumor p53/genética , Inteligência Artificial , Sítios de Ligação/genética , Humanos , Internet , Modelos Moleculares , Mutação/fisiologia , Mutação de Sentido Incorreto/genética , Mutação de Sentido Incorreto/fisiologia , Neoplasias/tratamento farmacológico , Neoplasias/genética , Dobramento de Proteína , Estrutura Terciária de Proteína , Curva ROC , Supressão Genética/genética , Supressão Genética/fisiologia , Propriedades de Superfície , Proteína Supressora de Tumor p53/química
18.
BMC Bioinformatics ; 7: 246, 2006 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-16677385

RESUMO

BACKGROUND: Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different. RESULTS: Contrary to the local alignment score computed by the Smith-Waterman algorithm, the local alignment kernel is differentiable with respect to the amino acid substitution and its derivative can be computed efficiently by dynamic programming. We optimized the substitution matrix by classical gradient descent by setting an objective function that measures how well the local alignment kernel discriminates homologs from non-homologs in the COG database. The local alignment kernel exhibits better performance when it uses the matrices and gap parameters optimized by this procedure than when it uses the matrices optimized for the Smith-Waterman algorithm. Furthermore, the matrices and gap parameters optimized for the local alignment kernel can also be used successfully by the Smith-Waterman algorithm. CONCLUSION: This optimization procedure leads to useful substitution matrices, both for the local alignment kernel and the Smith-Waterman algorithm. The best performance for homology detection is obtained by the local alignment kernel.


Assuntos
Algoritmos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Substituição de Aminoácidos , Inteligência Artificial , Sequência Conservada , Dados de Sequência Molecular , Reconhecimento Automatizado de Padrão/métodos , Homologia de Sequência de Aminoácidos
19.
Proteins ; 62(3): 617-29, 2006 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-16320312

RESUMO

The formation of disulphide bridges between cysteines plays an important role in protein folding, structure, function, and evolution. Here, we develop new methods for predicting disulphide bridges in proteins. We first build a large curated data set of proteins containing disulphide bridges to extract relevant statistics. We then use kernel methods to predict whether a given protein chain contains intrachain disulphide bridges or not, and recursive neural networks to predict the bonding probabilities of each pair of cysteines in the chain. These probabilities in turn lead to an accurate estimation of the total number of disulphide bridges and to a weighted graph matching problem that can be addressed efficiently to infer the global disulphide bridge connectivity pattern. This approach can be applied both in situations where the bonded state of each cysteine is known, or in ab initio mode where the state is unknown. Furthermore, it can easily cope with chains containing an arbitrary number of disulphide bridges, overcoming one of the major limitations of previous approaches. It can classify individual cysteine residues as bonded or nonbonded with 87% specificity and 89% sensitivity. The estimate for the total number of bridges in each chain is correct 71% of the times, and within one from the true value over 94% of the times. The prediction of the overall disulphide connectivity pattern is exact in about 51% of the chains. In addition to using profiles in the input to leverage evolutionary information, including true (but not predicted) secondary structure and solvent accessibility information yields small but noticeable improvements. Finally, once the system is trained, predictions can be computed rapidly on a proteomic or protein-engineering scale. The disulphide bridge prediction server (DIpro), software, and datasets are available through www.igb.uci.edu/servers/psss.html.


Assuntos
Dissulfetos/química , Proteínas/química , Sequência de Aminoácidos , Cisteína , Bases de Dados de Proteínas , Modelos Moleculares , Probabilidade , Conformação Proteica , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos
20.
Protein Sci ; 14(11): 2804-13, 2005 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-16251364

RESUMO

As the number of complete genomes rapidly increases, accurate methods to automatically predict the subcellular location of proteins are increasingly useful to help their functional annotation. In order to improve the predictive accuracy of the many prediction methods developed to date, a novel representation of protein sequences is proposed. This representation involves local compositions of amino acids and twin amino acids, and local frequencies of distance between successive (basic, hydrophobic, and other) amino acids. For calculating the local features, each sequence is split into three parts: N-terminal, middle, and C-terminal. The N-terminal part is further divided into four regions to consider ambiguity in the length and position of signal sequences. We tested this representation with support vector machines on two data sets extracted from the SWISS-PROT database. Through fivefold cross-validation tests, overall accuracies of more than 87% and 91% were obtained for eukaryotic and prokaryotic proteins, respectively. It is concluded that considering the respective features in the N-terminal, middle, and C-terminal parts is helpful to predict the subcellular location.


Assuntos
Inteligência Artificial , Proteínas/análise , Análise de Sequência de Proteína/métodos , Aminoácidos Básicos/química , Interações Hidrofóbicas e Hidrofílicas , Proteínas de Plantas/análise , Proteínas de Plantas/química , Sinais Direcionadores de Proteínas , Proteínas/química , Reprodutibilidade dos Testes
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...